A professor/researcher using R for more than 10 years and who developed more than 10 R packages.
The course will be organized as follows:
R is one of the most used softwares for Data Science:
But there are of course good alternatives:
R is very good for DS because of:
To start a project in R:
Two main commands:
run is devoted to execute line by line the codesource will execute the whole script at onceNote: the Base R Cheat Sheet can be downloaded on the Rstudio website: https://www.rstudio.com/resources/cheatsheets/
R can be used to make simple (and less simple) caclulations:
2+2
## [1] 4
exp(-2)
## [1] 0.1353353
All calculations are made using vectors (or matrices later) and the [1] is a trace of this vector nature
rnorm(15)
## [1] -0.63614366 -1.79161413 -0.68138714 -1.11974437 0.68346521
## [6] 0.07838463 -1.39671431 3.47389963 -0.23125699 -1.24024898
## [11] 0.85310013 -0.01432660 -2.14595092 1.27271347 1.47870982
This command produces a vector of 15 values and the number within brackets at the beginning of the lines indicate the rank of the values in the vector.
x = 2
It is possible to store the results in variables. The = operator assigns the value to the variable. When all is ok, R does not display anything!
It is also possible to use the arrow operator -> which usually works in the same way as the = operator
a = c(1,2,3,4,5)
The c() function allows to create vectors. Warning: all vectors must be of the same type (numeric, characters, booleans)!
This is a vector of booleans:
a = c(TRUE,FALSE,TRUE,FALSE,FALSE)
All operators are working on vectors elements by elements:
a = c(1,2,3,4,5)
b = c(2,1,4,6,1)
a / b
## [1] 0.5000000 2.0000000 0.7500000 0.6666667 5.0000000
The resulting vector is of the same size of the two input vectors.
Note: the operator for matrix multiplication is %*%
c = a / b
c^2
## [1] 0.2500000 4.0000000 0.5625000 0.4444444 25.0000000
Here, the expentation is again element by element.
It is usually very easy to translate Maths into code:
xbar = sum(a) / length(a)
xbar
## [1] 3
The R documentation is very well structured and it is easy to find information from it. A search in the documentation is usually in two steps:
??student
?t.test
The documentation is always organised in the same manner:
Description: a short description in English of the method implemented in the functionUsage: the way the function should be used, with all input parameters and their default valuesArguments: a detailed explenation of the input parametersValues: a detailed explenation of the output fieldsThere are plenty of already coded functions:
mean(a)
## [1] 3
sd(a)
## [1] 1.581139
range(a)
## [1] 1 5
max(a)
## [1] 5
Note: missing values are represented in R with the
NAtype. It is possible to locate missing values using theis.na()function.
In R, the generic plot function allows to output a graphic from some object:
a = c(1,2,3,4,5)
b = c(2,1,4,6,1)
plot(b,a)
plot(b,a,pch=19,col='lightblue',cex=2,type='b')
title(main='My wonderful title')
Note: a plot is build like a Lego, meaning it is not possible to mify something whitout restarting from zero!
Two options:
How to proceed for CSV:
# Load the data into X
X = read.csv('path/toward/thefile.csv')
# Save some data into a file
write.csv(X,file='path/toward/thefile2.csv')
How to proceed for Rdata:
# Load the data
load('path/toward/thefile.Rdata')
# Save some data into a file
save(X,Y,a,b,out,file='path/toward/thefile2.Rdata')
Note: the
rm(list=ls())allows to clean the working space.
Write a script allowing to load a vector file and “remove” the missing values.
# Load data
x = read.csv("path/to/myfile.csv")
x = as.vector(x$x) # to ensure that x is a vector
# Count the number of NAs
prop = sum(is.na(x)) / lenght(x)
# If too many NAs, remove them
if (prop > 0.05){
x = x[-which(is.na(x))]
}
# Otherwise, replace NAs with the mean
if (prop <= 0.05){
m = mean(x,na.rm = TRUE)
x[which(is.na(x))] = m
}
First, it important to understand that R has two kinds of “objects”:
To differentiate between functions and objects, functions have to be called every time with brackets ()!
plot = c(1,1,2,3,4,5)
plot(plot)
About the functions, the arguments (imput parameters) can be provided in any order:
x = c(1,2,3,4,5,6)
y = sin(x)
plot(x,y,type='b')
or
plot(y=y,x=x,col='red',type='b')
if you declare clearly which argument you are providing!
Vectors can be created with the c() function:
x = c(1,2,3,4,5)
x = c(TRUE,FALSE,TRUE)
x = c('tyty',"tutu",'yuyu')
or some other ways:
x = seq(1,10)
x
## [1] 1 2 3 4 5 6 7 8 9 10
x = seq(1,11,by = 1.5)
x
## [1] 1.0 2.5 4.0 5.5 7.0 8.5 10.0
x = seq(1,10,length.out = 100)
x
## [1] 1.000000 1.090909 1.181818 1.272727 1.363636 1.454545 1.545455
## [8] 1.636364 1.727273 1.818182 1.909091 2.000000 2.090909 2.181818
## [15] 2.272727 2.363636 2.454545 2.545455 2.636364 2.727273 2.818182
## [22] 2.909091 3.000000 3.090909 3.181818 3.272727 3.363636 3.454545
## [29] 3.545455 3.636364 3.727273 3.818182 3.909091 4.000000 4.090909
## [36] 4.181818 4.272727 4.363636 4.454545 4.545455 4.636364 4.727273
## [43] 4.818182 4.909091 5.000000 5.090909 5.181818 5.272727 5.363636
## [50] 5.454545 5.545455 5.636364 5.727273 5.818182 5.909091 6.000000
## [57] 6.090909 6.181818 6.272727 6.363636 6.454545 6.545455 6.636364
## [64] 6.727273 6.818182 6.909091 7.000000 7.090909 7.181818 7.272727
## [71] 7.363636 7.454545 7.545455 7.636364 7.727273 7.818182 7.909091
## [78] 8.000000 8.090909 8.181818 8.272727 8.363636 8.454545 8.545455
## [85] 8.636364 8.727273 8.818182 8.909091 9.000000 9.090909 9.181818
## [92] 9.272727 9.363636 9.454545 9.545455 9.636364 9.727273 9.818182
## [99] 9.909091 10.000000
This would be very useful to get a nice plot of the sinus function.
x = seq(-pi,pi,length.out = 100)
plot(x,sin(x),type='l')
You can also use the : operator which equivalent to the default seq() function.
x = seq(1:10)
x
## [1] 1 2 3 4 5 6 7 8 9 10
y = 1:10
y
## [1] 1 2 3 4 5 6 7 8 9 10
w = 0.5:10
w
## [1] 0.5 1.5 2.5 3.5 4.5 5.5 6.5 7.5 8.5 9.5
And the last way of creating vectors is to use the rep function.
x = rep(0,10)
x
## [1] 0 0 0 0 0 0 0 0 0 0
This is useful in particular to initialize a vector which will be used in a loop or a function.
Tip: always initialize your vectors with a “strange” value for the task (for instance NA’s)
m = rep(NA,10)
for (i in 1:10) m[i] = mean(rnorm(10))
m
## [1] -0.40058418 -0.32261605 -0.12701528 -0.02434036 -0.29617637
## [6] -0.52112056 -0.23050035 0.24287414 0.10853839 0.58870651
Matrices can be created with the matrix function:
A = matrix(c(1,2,3,4,5,6),nrow = 2,byrow = TRUE)
A
## [,1] [,2] [,3]
## [1,] 1 2 3
## [2,] 4 5 6
Another option would be to use the cbind and rdind functions:
A = cbind(c(1,2),c(2,3),c(4,5))
A
## [,1] [,2] [,3]
## [1,] 1 2 4
## [2,] 2 3 5
A = rbind(c(1,2,3),c(4,5,6))
A
## [,1] [,2] [,3]
## [1,] 1 2 3
## [2,] 4 5 6
To access vectors and matrices, we cannot use the barckets ()! Instead, we use the square brackets [...]
x = seq(1,10,length.out = 25)
# Access the 5th element of x
x[5]
## [1] 2.5
# Access all elements between the 5th and the 15th elements
x[5:15]
## [1] 2.500 2.875 3.250 3.625 4.000 4.375 4.750 5.125 5.500 5.875 6.250
# Access all elements except those between 5 and 15
y = x[-(5:15)]
y
## [1] 1.000 1.375 1.750 2.125 6.625 7.000 7.375 7.750 8.125 8.500
## [11] 8.875 9.250 9.625 10.000
length(y)
## [1] 14
With the square brackets, you can also modify the values of a vector:
x[5] = NA
x
## [1] 1.000 1.375 1.750 2.125 NA 2.875 3.250 3.625 4.000 4.375
## [11] 4.750 5.125 5.500 5.875 6.250 6.625 7.000 7.375 7.750 8.125
## [21] 8.500 8.875 9.250 9.625 10.000
It is also possible to do some conditional access:
x[is.na(x)] = -1
x
## [1] 1.000 1.375 1.750 2.125 -1.000 2.875 3.250 3.625 4.000 4.375
## [11] 4.750 5.125 5.500 5.875 6.250 6.625 7.000 7.375 7.750 8.125
## [21] 8.500 8.875 9.250 9.625 10.000
x[x < 4] = 0
x
## [1] 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 4.000 4.375
## [11] 4.750 5.125 5.500 5.875 6.250 6.625 7.000 7.375 7.750 8.125
## [21] 8.500 8.875 9.250 9.625 10.000
x[x < 4 | x > 8] = NA
x
## [1] NA NA NA NA NA NA NA NA 4.000 4.375 4.750
## [12] 5.125 5.500 5.875 6.250 6.625 7.000 7.375 7.750 NA NA NA
## [23] NA NA NA
Note: the
|operator denotes a OR. The AND is denoted by the&operator.
For matrices, you will have to provide position information about rows and columns:
A = diag(1,5)
A
## [,1] [,2] [,3] [,4] [,5]
## [1,] 1 0 0 0 0
## [2,] 0 1 0 0 0
## [3,] 0 0 1 0 0
## [4,] 0 0 0 1 0
## [5,] 0 0 0 0 1
A[1:2,4:5]
## [,1] [,2]
## [1,] 0 0
## [2,] 0 0
If you would like to access all rows (resp. columns) of a matrix, you can use the “nothing” operator:
A[,4:5]
## [,1] [,2]
## [1,] 0 0
## [2,] 0 0
## [3,] 0 0
## [4,] 1 0
## [5,] 0 1
In some cases, you perhaps would like to access some specific locations:
A[c(1,3),c(2,5)]
## [,1] [,2]
## [1,] 0 0
## [2,] 0 0
Lists are objects able to store information of different types. All functions return lists, so it is important to be able to manipulate them!
mylist = list(name='Charles',age=28,married=TRUE)
mylist
## $name
## [1] "Charles"
##
## $age
## [1] 28
##
## $married
## [1] TRUE
mylist$name = 'Tutu'
mylist
## $name
## [1] "Tutu"
##
## $age
## [1] 28
##
## $married
## [1] TRUE
mylist$Nb_of_cats = 4
mylist
## $name
## [1] "Tutu"
##
## $age
## [1] 28
##
## $married
## [1] TRUE
##
## $Nb_of_cats
## [1] 4
out = t.test(1:10, y = c(7:20))
out
##
## Welch Two Sample t-test
##
## data: 1:10 and c(7:20)
## t = -5.4349, df = 21.982, p-value = 1.855e-05
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -11.052802 -4.947198
## sample estimates:
## mean of x mean of y
## 5.5 13.5
print.default(out)
## $statistic
## t
## -5.43493
##
## $parameter
## df
## 21.98221
##
## $p.value
## [1] 1.855282e-05
##
## $conf.int
## [1] -11.052802 -4.947198
## attr(,"conf.level")
## [1] 0.95
##
## $estimate
## mean of x mean of y
## 5.5 13.5
##
## $null.value
## difference in means
## 0
##
## $alternative
## [1] "two.sided"
##
## $method
## [1] "Welch Two Sample t-test"
##
## $data.name
## [1] "1:10 and c(7:20)"
##
## attr(,"class")
## [1] "htest"
out$statistic
## t
## -5.43493
sqrt(abs(out$statistic))
## t
## 2.331294
out$p.value
## [1] 1.855282e-05
The last object is the data.frame and it is devoted to store data sets. In a data.frame, rows are for individuals and columns are for variables.
# We create a data.frame with 28 individuals and 3 variables (2 numeric and 1 categorical)
X = data.frame(x = seq(1,10,length.out = 28),
y = seq(-3,3,length.out = 28),
sex = c(rep('M',10),rep('F',18)))
X
## x y sex
## 1 1.000000 -3.0000000 M
## 2 1.333333 -2.7777778 M
## 3 1.666667 -2.5555556 M
## 4 2.000000 -2.3333333 M
## 5 2.333333 -2.1111111 M
## 6 2.666667 -1.8888889 M
## 7 3.000000 -1.6666667 M
## 8 3.333333 -1.4444444 M
## 9 3.666667 -1.2222222 M
## 10 4.000000 -1.0000000 M
## 11 4.333333 -0.7777778 F
## 12 4.666667 -0.5555556 F
## 13 5.000000 -0.3333333 F
## 14 5.333333 -0.1111111 F
## 15 5.666667 0.1111111 F
## 16 6.000000 0.3333333 F
## 17 6.333333 0.5555556 F
## 18 6.666667 0.7777778 F
## 19 7.000000 1.0000000 F
## 20 7.333333 1.2222222 F
## 21 7.666667 1.4444444 F
## 22 8.000000 1.6666667 F
## 23 8.333333 1.8888889 F
## 24 8.666667 2.1111111 F
## 25 9.000000 2.3333333 F
## 26 9.333333 2.5555556 F
## 27 9.666667 2.7777778 F
## 28 10.000000 3.0000000 F
Data.frame can be used as matrices or lists:
X[5:10,]
## x y sex
## 5 2.333333 -2.111111 M
## 6 2.666667 -1.888889 M
## 7 3.000000 -1.666667 M
## 8 3.333333 -1.444444 M
## 9 3.666667 -1.222222 M
## 10 4.000000 -1.000000 M
X$sex
## [1] M M M M M M M M M M F F F F F F F F F F F F F F F F F F
## Levels: F M
X[,3]
## [1] M M M M M M M M M M F F F F F F F F F F F F F F F F F F
## Levels: F M
summary(X)
## x y sex
## Min. : 1.00 Min. :-3.0 F:18
## 1st Qu.: 3.25 1st Qu.:-1.5 M:10
## Median : 5.50 Median : 0.0
## Mean : 5.50 Mean : 0.0
## 3rd Qu.: 7.75 3rd Qu.: 1.5
## Max. :10.00 Max. : 3.0
Some useful functions to get information about objects:
#Length of a vector
length(x)
## [1] 25
# Length of a list (nb of fields)
length(mylist)
## [1] 4
# For matrices and data.frames
nrow(X)
## [1] 28
ncol(X)
## [1] 3
# For data.frames
colnames(X)
## [1] "x" "y" "sex"
rownames(X)
## [1] "1" "2" "3" "4" "5" "6" "7" "8" "9" "10" "11" "12" "13" "14"
## [15] "15" "16" "17" "18" "19" "20" "21" "22" "23" "24" "25" "26" "27" "28"
It is also possible to modify the names of variables and individuals
colnames(X) = c('Var 1','Var 2','Var 3')
X
## Var 1 Var 2 Var 3
## 1 1.000000 -3.0000000 M
## 2 1.333333 -2.7777778 M
## 3 1.666667 -2.5555556 M
## 4 2.000000 -2.3333333 M
## 5 2.333333 -2.1111111 M
## 6 2.666667 -1.8888889 M
## 7 3.000000 -1.6666667 M
## 8 3.333333 -1.4444444 M
## 9 3.666667 -1.2222222 M
## 10 4.000000 -1.0000000 M
## 11 4.333333 -0.7777778 F
## 12 4.666667 -0.5555556 F
## 13 5.000000 -0.3333333 F
## 14 5.333333 -0.1111111 F
## 15 5.666667 0.1111111 F
## 16 6.000000 0.3333333 F
## 17 6.333333 0.5555556 F
## 18 6.666667 0.7777778 F
## 19 7.000000 1.0000000 F
## 20 7.333333 1.2222222 F
## 21 7.666667 1.4444444 F
## 22 8.000000 1.6666667 F
## 23 8.333333 1.8888889 F
## 24 8.666667 2.1111111 F
## 25 9.000000 2.3333333 F
## 26 9.333333 2.5555556 F
## 27 9.666667 2.7777778 F
## 28 10.000000 3.0000000 F
rownames(X) = paste('Individual',1:28)
X
## Var 1 Var 2 Var 3
## Individual 1 1.000000 -3.0000000 M
## Individual 2 1.333333 -2.7777778 M
## Individual 3 1.666667 -2.5555556 M
## Individual 4 2.000000 -2.3333333 M
## Individual 5 2.333333 -2.1111111 M
## Individual 6 2.666667 -1.8888889 M
## Individual 7 3.000000 -1.6666667 M
## Individual 8 3.333333 -1.4444444 M
## Individual 9 3.666667 -1.2222222 M
## Individual 10 4.000000 -1.0000000 M
## Individual 11 4.333333 -0.7777778 F
## Individual 12 4.666667 -0.5555556 F
## Individual 13 5.000000 -0.3333333 F
## Individual 14 5.333333 -0.1111111 F
## Individual 15 5.666667 0.1111111 F
## Individual 16 6.000000 0.3333333 F
## Individual 17 6.333333 0.5555556 F
## Individual 18 6.666667 0.7777778 F
## Individual 19 7.000000 1.0000000 F
## Individual 20 7.333333 1.2222222 F
## Individual 21 7.666667 1.4444444 F
## Individual 22 8.000000 1.6666667 F
## Individual 23 8.333333 1.8888889 F
## Individual 24 8.666667 2.1111111 F
## Individual 25 9.000000 2.3333333 F
## Individual 26 9.333333 2.5555556 F
## Individual 27 9.666667 2.7777778 F
## Individual 28 10.000000 3.0000000 F
X["Individual 26",]
## Var 1 Var 2 Var 3
## Individual 26 9.333333 2.555556 F
Exercise: Create two vectors of different dimensions and insert the second one in the first one between the 2nd and 3rd elements.
a = c(1,2,3,4,5)
b = runif(10)
c = c(a[1:2],b,a[-(1:2)])
c
## [1] 1.00000000 2.00000000 0.48180683 0.13411387 0.42401444 0.96252263
## [7] 0.37654661 0.73015810 0.86870886 0.04555321 0.74810374 0.38847350
## [13] 3.00000000 4.00000000 5.00000000
Exercise: Draw 100 numbers from a Uniform distribution on [0,1] and count how many values are larger than 0.5
x = runif(100,min = 0,max = 1)
sum(x>=0.5)
## [1] 50
length(x[x>=0.5])
## [1] 50
Exercise: Draw 100 couples \((x,y)\) randomly in the square \([0,1]^2\) (use a data.frame to store the data). Count how many couples are within the square \([0.5,1]^2\). Plot the couples and display with a red point the mean of all points.
X = data.frame(x = runif(100), y = runif(100))
sum(X$x >= 0.5 & X$y >= 0.5)
## [1] 24
plot(X,col='lightblue',pch=19)
mean = colMeans(X)
points(mean[1],mean[2],col='red',pch=19,cex=2)
Exercise: It is Spring! We consider a field of flower bulbes. We have 100 bulbes in a field of 1m x 1m. We know that the probability that a bulbe blooms is 0.85 (use a data.frame to store the data). Simulate the blossoming of the bulbes!
X = data.frame(x = runif(100), y = runif(100),
b = rbinom(100,size = 1,prob = 0.85))
X
## x y b
## 1 0.482385463 0.95272036 0
## 2 0.935273987 0.48938969 1
## 3 0.833426535 0.36876173 1
## 4 0.491739545 0.21967751 1
## 5 0.512224207 0.22175255 1
## 6 0.169196047 0.81228363 1
## 7 0.798336988 0.41465905 1
## 8 0.915688189 0.52225389 1
## 9 0.901797233 0.97200626 1
## 10 0.594102430 0.39050110 1
## 11 0.709332747 0.17575912 1
## 12 0.668688073 0.74051950 1
## 13 0.190941632 0.92416120 1
## 14 0.123592664 0.55418842 1
## 15 0.270830068 0.89318131 1
## 16 0.808378143 0.62003766 1
## 17 0.599745586 0.68595147 1
## 18 0.827772828 0.93299986 1
## 19 0.794061270 0.61640828 1
## 20 0.651917435 0.19099553 1
## 21 0.154867405 0.42605300 0
## 22 0.759520175 0.76551816 1
## 23 0.746298232 0.16841549 1
## 24 0.738623844 0.04501963 1
## 25 0.104805069 0.96593480 1
## 26 0.854276683 0.92916019 1
## 27 0.547584553 0.08269657 1
## 28 0.014208163 0.23111107 1
## 29 0.002305876 0.97891846 1
## 30 0.311017394 0.09976388 1
## 31 0.136308048 0.68807202 1
## 32 0.515896664 0.67555061 1
## 33 0.454988125 0.63667950 1
## 34 0.437051728 0.90632182 1
## 35 0.166306296 0.29200164 1
## 36 0.743982983 0.87407347 1
## 37 0.822936576 0.95302615 1
## 38 0.359468095 0.26662019 1
## 39 0.785042912 0.01828725 1
## 40 0.395148351 0.06564920 1
## 41 0.690722077 0.92359795 1
## 42 0.797622001 0.93304227 1
## 43 0.717520420 0.79626286 1
## 44 0.061249074 0.75552193 1
## 45 0.658153805 0.98950556 1
## 46 0.475078070 0.51126781 1
## 47 0.484872019 0.69078698 1
## 48 0.366873633 0.28560776 1
## 49 0.214307128 0.15989784 1
## 50 0.225645179 0.38124649 1
## 51 0.584252419 0.48223083 1
## 52 0.558898341 0.58456680 1
## 53 0.650039612 0.73270031 0
## 54 0.499680690 0.69274960 0
## 55 0.880561361 0.52353401 1
## 56 0.590076074 0.42514644 1
## 57 0.265238620 0.03443015 1
## 58 0.217237896 0.49831299 1
## 59 0.669885437 0.39391366 1
## 60 0.763745511 0.36414456 1
## 61 0.425271056 0.78846814 1
## 62 0.662841910 0.73856696 1
## 63 0.384085254 0.55445714 1
## 64 0.558712669 0.75126621 1
## 65 0.879896356 0.45663479 1
## 66 0.130927367 0.48334692 1
## 67 0.553595135 0.50196796 1
## 68 0.680771236 0.05204429 1
## 69 0.127966061 0.37342640 1
## 70 0.739712183 0.64602212 1
## 71 0.338359063 0.21828268 1
## 72 0.343835741 0.03412282 1
## 73 0.307959618 0.84937405 0
## 74 0.320364905 0.49873904 1
## 75 0.379980090 0.22769280 1
## 76 0.812421255 0.90316739 1
## 77 0.858120196 0.28602949 0
## 78 0.703462673 0.68164390 1
## 79 0.937758762 0.56573134 1
## 80 0.391415598 0.12947586 1
## 81 0.101512408 0.17192509 1
## 82 0.792423530 0.17843453 1
## 83 0.936878131 0.39919860 1
## 84 0.413874270 0.76720697 1
## 85 0.987793410 0.39508023 1
## 86 0.946944396 0.25834206 0
## 87 0.728985428 0.56984491 1
## 88 0.835279047 0.51134936 1
## 89 0.522861788 0.62924206 1
## 90 0.650369765 0.79743108 1
## 91 0.875004860 0.65072493 1
## 92 0.794561896 0.04576638 1
## 93 0.380683979 0.01460950 1
## 94 0.335860675 0.56088856 1
## 95 0.682850689 0.12609909 1
## 96 0.973856049 0.00358598 1
## 97 0.299150000 0.12330233 1
## 98 0.278790976 0.91354638 1
## 99 0.146340450 0.54135608 1
## 100 0.491211016 0.02464513 1
plot(X[,1:2],pch=19,col=X$b+1)
nb = sum(X$b == 1)
title(main = paste(nb,"bulbes become flowers!"))
legend("bottomleft",legend = c('Failed','Bloomed'),col=c(1,2),pch=19)
The graphical system is very complex and works in Lego fashion, i.e. it is not possible to remove something on plot wihout clearing everything!
Of course, it is possible to add on a figure texts, legend, title, etc … and to save the figure in an image format (png, jpg, pdf, eps, …)
The window of the plot is automatically set up according to the data scale.
x = runif(100)
y = rnorm(100)
plot(x,y,pch=19)
It is possible to resize the window by using the xlim and ylim arguments.
plot(x,y,pch=19,xlim=c(-1,2),ylim=c(-3,4))
points(1.5,0,col='red')
points(0,3,col='blue')
plot(x[x >= 0.5],y[x >= 0.5],pch=19,
xlab='x',ylab='y',main='My title')
title(sub='My subtitle')
legend("bottomright",legend = 'my points',col=1,pch=19)
text(x,y,1:length(x >= 0.5),pos = 1)
To draw some line (for instance for linear regression), we use the abline function:
plot(x,x+rnorm(100,0,0.2),pch=19,ylab='y')
abline(a=0,b=1,lty=2,lwd=2,col='red')
abline(v = 0.5,lty=3,col='blue')
It is also possible to add texts in the margin:
plot(x,x+rnorm(100,0,0.2),pch=19,ylab='y')
abline(a=0,b=1,lty=2,lwd=2,col='red')
abline(v = 0.5,lty=3,col='blue')
mtext("A nice linear model",side = 4)
Histograms are a way to look at the distribution of the data.
x = rnorm(200,mean = 2,sd = 1.5)
hist(x,col='lavender')
Be careful with histograms: it is not so easy to draw a great histogram! In addition, keep in mind that it is possible to manipulate a histogram to pass a specific message!
For example, this is the default result for a uniformly distributed variable:
x = runif(100,0,1)
hist(x)
And this are other possible histograms for the same data:
par(mfrow=c(1,2))
hist(x,breaks=2)
hist(x,breaks=100)
It is possible to combine or replace the histogram with a plot o the estimated density function. The function to estimate the density function is density:
x = c(rnorm(100,2,1),rgamma(50,shape = 1))
hist(x,freq = FALSE)
f = density(x)
lines(f,lwd=2,col='red',lty=2)
It is possible to compare the empirical distribution of the data with the theorical function of some specific distribution:
x = c(rnorm(100,2,1),rgamma(50,shape = 1))
qqnorm(x)
x = c(rnorm(100,2,1),rgamma(50,shape = 1))
qqplot(x,y = rgamma(1000,shape=1))
Perhaps, the best way to look at the distribution of some data is to do a boxplot. Boxplots are great becaus the recipie is unique!
par(mfrow=c(1,2))
boxplot(rnorm(100),title='Normal')
boxplot(runif(100),title='Uniform')
par(mfrow=c(1,2))
boxplot(rnorm(100),title='Normal')
boxplot(rgamma(100,shape=1),title='Gamma')
But, it is a very simple plot and it is not easy to look at complex distributions:
x = c(rnorm(100,2,1),rgamma(50,shape = 1))
boxplot(x,title='Mixture dist.')
Interestingly, it is possible to display multavariate data with boxplots:
x = iris[,-5]
boxplot(x)
This is very helpful but limited to data on the same unit!
To represent multivariate data, we also have the pairplot funtion.
x = iris[,-5]
pairs(x)
Remark: Keep in mind that some information can remain hidden in high-dimensional spaces!
It is possible to add categorical variables on a plot by changing the colors or the forms of the points.
x = iris[,-5]
y = iris$Species
pairs(x,col=as.numeric(y),pch=as.numeric(y))
Finally, for categorical variables, you can draw pie plots:
y = iris$Species
pie(summary(y))
We are going to use an extra package, ggplot, that can be download on the CRAN repository.
We are going to use the mpg data set to illustrate the use of the package:
library(ggplot2)
?mpg
mpg
## # A tibble: 234 x 11
## manufacturer model displ year cyl trans drv cty hwy fl cla~
## <chr> <chr> <dbl> <int> <int> <chr> <chr> <int> <int> <chr> <ch>
## 1 audi a4 1.8 1999 4 auto~ f 18 29 p com~
## 2 audi a4 1.8 1999 4 manu~ f 21 29 p com~
## 3 audi a4 2 2008 4 manu~ f 20 31 p com~
## 4 audi a4 2 2008 4 auto~ f 21 30 p com~
## 5 audi a4 2.8 1999 6 auto~ f 16 26 p com~
## 6 audi a4 2.8 1999 6 manu~ f 18 26 p com~
## 7 audi a4 3.1 2008 6 auto~ f 18 27 p com~
## 8 audi a4 q~ 1.8 1999 4 manu~ 4 18 26 p com~
## 9 audi a4 q~ 1.8 1999 4 auto~ 4 16 25 p com~
## 10 audi a4 q~ 2 2008 4 manu~ 4 20 28 p com~
## # ... with 224 more rows
The ggplot package has a very specific wayt of working:
ggplot(data = mpg) + geom_histogram(aes(x = hwy))
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
Notice that
geom_histogramalways use 30 bins for making the plot! It is necessary to provide a better value if you want!
ggplot(data = mpg) + geom_histogram(aes(x = hwy), bins=50, fill = 'pink')
ggplot(data = mpg) + geom_histogram(aes(x = hwy, y = ..density..), bins=50, fill = 'pink') + geom_density(aes(x = hwy),col = 'green')
It is also possible to draw boxplots:
boxplot(mpg$hwy)
ggplot(data = mpg) + geom_boxplot(aes(x = '',y = hwy), fill = 'lightblue')
Let’s compare the distribution of the hwy variable with one normal distribution with the quantile-quantile plot (qqplot for friends!):
ggplot(data = mpg) +
geom_qq(aes(sample = hwy)) +
geom_qq_line(aes(sample = hwy))
ggplot(data = mpg) +
geom_qq(aes(sample = hwy),distribution = stats::qgamma,dparams = (shape=1)) +
geom_qq_line(aes(sample = hwy),distribution = stats::qgamma,dparams = (shape=1))
For categorical data, we also have some barplots:
ggplot(data = mpg) + geom_bar(aes(x = trans))
It is also possible to have some original graphics that are not easy to obtain with the usual plot function, like this one:
ggplot(data = mpg) + geom_bar(aes(x = "", fill = factor(trans)))
To draw a barplot:
ggplot(data = mpg) + geom_bar(aes(x = "", fill = factor(trans))) + coord_polar(theta = "y")
For two-dimensional data, we have also specific plots:
ggplot(data = mpg) + geom_point(aes(x = displ,y = hwy))
ggplot(data = mpg) + geom_point(aes(x = displ,y = hwy,col = class, shape = factor(year)))
It is also possible to display the data conditionnaly on one categorical variable (here the class variable):
ggplot(data = mpg) + geom_point(mapping = aes(x = displ, y = hwy, col = class)) + facet_wrap( ~ class, nrow = 3)
It is even possible to make the plots conditionnaly on two categorical variables:
ggplot(data = mpg) + geom_point(aes(x = displ, y = hwy)) + facet_grid(drv ~ cyl)
Let’s try to do the same without ggplot:
par(mfrow=c(3,4))
for (m in unique(mpg$drv))
for (v in unique(mpg$cyl)){
sel = which(mpg$drv == m & mpg$cyl == v)
if (length(sel)>0) plot(mpg$displ[sel],mpg$hwy[sel],pch=19)
else plot(0,0,type='n')
}
ggplot(data = mpg) +
geom_point(mapping = aes(x = displ, y = hwy)) +
geom_smooth(mapping = aes(x = displ, y = hwy))
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'
ggplot(data = mpg) +
geom_point(mapping = aes(x = displ, y = hwy)) +
geom_smooth(mapping = aes(x = displ, y = hwy),method = 'lm')
If one wuld like to interpret the coefficients of the lm model, it is necessary to use directrly the lm function:
out = lm(mpg$hwy ~ 1 + mpg$displ)
out
##
## Call:
## lm(formula = mpg$hwy ~ 1 + mpg$displ)
##
## Coefficients:
## (Intercept) mpg$displ
## 35.698 -3.531
plot(out)
plot( mpg$displ, mpg$hwy, type='p')
abline(out, col='red')
It is worth noticing that the ggplot graphic is more meaningful regarding the distribution of the residuals.
When working with multivariate categorical data, it still possible to have nice plots!
table(mpg$cyl,mpg$drv)
##
## 4 f r
## 4 23 58 0
## 5 0 4 0
## 6 32 43 4
## 8 48 1 21
ggplot(data = mpg) +
geom_bar(aes(x = factor(cyl), fill = factor(drv)))
ggplot(data = mpg) +
geom_bar(aes(x = factor(drv), fill = factor(cyl)))
There are variants of this plot, that can be obtained using the position option.
ggplot(data = mpg) + geom_bar(aes(x = factor(cyl), fill = factor(drv)), position = "dodge")
ggplot(data = mpg) + geom_bar(aes(x = factor(cyl), fill = factor(drv)), position = "fill")
ggplot(data = mpg) +
geom_bar(aes(x = "", fill = factor(cyl)), position = "fill") +
coord_polar(theta = "y") +
facet_wrap(~ factor(drv))
Remark: in the two last plots, information about the counts are missing and this could be dangerous when intyerpreting…
tapply(mpg$displ, mpg$cyl, mean)
## 4 5 6 8
## 2.145679 2.500000 3.408861 5.132857
ggplot(data = mpg) + geom_boxplot(aes(x = factor(cyl), y = displ))
ggplot(data = mpg) + geom_histogram(aes(x = displ)) + facet_wrap(~ factor(cyl), ncol = 1)
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
ggplot(data = mpg) + geom_density(aes(x = displ, col = factor(cyl)))
Exercise: Use the
starwars data set in thedplyr` package to:
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
?starwars
starwars$name[starwars$species == 'Human']
## [1] "Luke Skywalker" "Darth Vader" "Leia Organa"
## [4] "Owen Lars" "Beru Whitesun lars" "Biggs Darklighter"
## [7] "Obi-Wan Kenobi" "Anakin Skywalker" "Wilhuff Tarkin"
## [10] "Han Solo" "Wedge Antilles" "Jek Tono Porkins"
## [13] "Palpatine" "Boba Fett" "Lando Calrissian"
## [16] "Lobot" "Mon Mothma" "Arvel Crynyd"
## [19] "Qui-Gon Jinn" "Finis Valorum" NA
## [22] NA "Shmi Skywalker" "Mace Windu"
## [25] "Gregar Typho" "Cord<U+00E9>" "Cliegg Lars"
## [28] "Dorm<U+00E9>" "Dooku" "Bail Prestor Organa"
## [31] "Jango Fett" "Jocasta Nu" NA
## [34] "Raymus Antilles" NA "Finn"
## [37] "Rey" "Poe Dameron" NA
## [40] "Padm<U+00E9> Amidala"
unique(starwars$homeworld)
## [1] "Tatooine" "Naboo" "Alderaan" "Stewjon"
## [5] "Eriadu" "Kashyyyk" "Corellia" "Rodia"
## [9] "Nal Hutta" "Bestine IV" NA "Kamino"
## [13] "Trandosha" "Socorro" "Bespin" "Mon Cala"
## [17] "Chandrila" "Endor" "Sullust" "Cato Neimoidia"
## [21] "Coruscant" "Toydaria" "Malastare" "Dathomir"
## [25] "Ryloth" "Vulpter" "Troiken" "Tund"
## [29] "Haruun Kal" "Cerea" "Glee Anselm" "Iridonia"
## [33] "Iktotch" "Quermia" "Dorin" "Champala"
## [37] "Geonosis" "Mirial" "Serenno" "Concord Dawn"
## [41] "Zolan" "Ojom" "Aleen Minor" "Skako"
## [45] "Muunilinst" "Shili" "Kalee" "Umbara"
## [49] "Utapau"
#levels(as.factor(starwars$homeworld))
weight = tapply(starwars$mass, starwars$species, mean)
height = tapply(starwars$height, starwars$species, mean)
Species = data.frame(weight,height)
Species
## weight height
## Aleena 15.0 79.0000
## Besalisk 102.0 198.0000
## Cerean 82.0 198.0000
## Chagrian NA 196.0000
## Clawdite 55.0 168.0000
## Droid NA NA
## Dug 40.0 112.0000
## Ewok 20.0 88.0000
## Geonosian 80.0 183.0000
## Gungan NA 208.6667
## Human NA NA
## Hutt 1358.0 175.0000
## Iktotchi NA 188.0000
## Kaleesh 159.0 216.0000
## Kaminoan NA 221.0000
## Kel Dor 80.0 188.0000
## Mirialan 53.1 168.0000
## Mon Calamari 83.0 180.0000
## Muun NA 191.0000
## Nautolan 87.0 196.0000
## Neimodian 90.0 191.0000
## Pau'an 80.0 206.0000
## Quermian NA 264.0000
## Rodian 74.0 173.0000
## Skakoan 48.0 193.0000
## Sullustan 68.0 160.0000
## Tholothian 50.0 184.0000
## Togruta 57.0 178.0000
## Toong 65.0 163.0000
## Toydarian NA 137.0000
## Trandoshan 113.0 190.0000
## Twi'lek NA 179.0000
## Vulptereen 45.0 94.0000
## Wookiee 124.0 231.0000
## Xexto NA 122.0000
## Yoda's species 17.0 66.0000
## Zabrak NA 173.0000
ggplot(data = Species) + geom_point(aes(x = weight,y = height)) +
geom_text(aes(x = weight,y = height,label = rownames(Species)),nudge_y = 10)
## Warning: Removed 12 rows containing missing values (geom_point).
## Warning: Removed 12 rows containing missing values (geom_text).
Species$count = table(starwars$species)
Species = Species[order(Species$count,decreasing = TRUE),]
#ggplot(data = Species) + geom_bar(aes(x = count))
barplot(Species$count)
ggplot(data = starwars) +
geom_point(aes(x = mass,y = height)) +
geom_smooth(aes(x = mass,y = height),method = 'lm')
## Warning: Removed 28 rows containing non-finite values (stat_smooth).
## Warning: Removed 28 rows containing missing values (geom_point).
Remark: we can see here that the linear model is very sensitive to ouliers!
dplyr